Not too long ago, the standard advice for mixing spatial analysis and R was to do your spatial analysis tasks with a GIS and your statistical analysis of the results of that analysis with R. However, since the mid 2000’s there has been a rapid increase in the spatial data handling and analysis capability of R. The rgdal, rgeos, raster, and sp packages (through the very significant effort of the package authors) provided the foundation for doing GIS entirely within R. More recently a new route for handling vector data has emerged in the sf package. Most current development for vector processing is focusing on sf. Raster data is still mostly dealt with via the raster package, although this is changing rapidly as development on the terra and stars packages is moving along quickly.
Before we dig into the packages, I want to do a bit of some basics just to make sure everyone is on the same page with the most common types of spatial data: vector and raster. It is important to understand this when working with spatial data as the tooling for working with the data will be different when using vector vs raster data. This is the case across all software and not specific to R.
Vector data is data that represents features with discrete locations and can be used to show things like points (e.g. species locations, fire hydrants), lines (e.g. roads, streams), or polygons (e.g. lakes, state boundaries). On the other hand, raster data breaks up the world into pixels. For each pixel we have a location and then a value for that pixel to indicated if it is a hydrant, a road, a stream, a lake, etc. See the example below:
Before we get to the R packages we need to discuss a bit about the underlying software libraries that make all of this GIS and spatial data analysis in R possible. The three primary pieces of software are the Geospatial Data Abstraction Library (GDAL), Geometry Engine - Open Source (GEOS), and PROJ and are all projects of the Open Source GeoSpatial Foundation and provide Input/Ouput for many different file types, topology and spatial operations, and coordinate tranformation functionality, respecitvely. These are the libraries that provide much of the funtionality for many open source GIS and spatial analysis software.
In R, these three software libraries are accessed and made available via several packages. The primary ones are discussed below.
The sf package provides vector data handling via the Simple Features standard, an Open Geospatial Consortium and International Standards Organization standard for spatial data. In addition, sf provides a tidy spatial data format that allows for manipulation with the popular dplyr package.
On Windows, getting sf added is no different than adding any other package that is on CRAN. If you are on linux or macOS then you’ll need to put in a bit more effort (i.e. need to have GDAL and GEOS installed). Best instructions for getting all this set up are at https://github.com/r-spatial/sf#installing. If you are on macOS, you will need XCode and homebrew installed. To do so, open up a Terminal window the type:
xcode-select --install
Once that finishes we can get the latest version of homebrew with this incantation:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Copy and paste this, hit return and then you should be ready to get the libraries needed for sf. With homebrew installed, we can now install what we need; gdal and geos.
brew install gdal
brew install geos
If you are on linux, I’ll assume you can get gdal and geos on your own!
In theory, we should be good to install sf now:
install.packages("sf")
library("sf")
You get a lot with sf. It provides the ability to work with a very wide array of spatial data formats (eg. points, lines, polygons, geometry collections, and more) plus many of the standard GIS functionalities (overlays, buffers, projections, etc.). One other package to be aware of that I have used often in conjunction with sf is lwgeom which provides some additional functionality, but we won’t be looking at that one today.
For our raster data processing I will mostly be using the venerable raster package.
To install, just do:
install.packages("raster")
library("raster")
The raster package has been around for some time and provides a lot of raster functionality. It has been extended with other packages such as gdistance which provides the ability to work with cost surfaces in R, and landsat which provides tools to process Landsat data. Similar to how sf recently replaced some older packages, a few new packages are being developed that likely will eventually replace raster. Two to be aware of are terra and stars. The development of the terra package is being led by Robert Hijmans (the author of raster) and according to the repository promises to have a, “very similar, but simpler, interface” and should be, “much faster.” The stars package, led by Edzer Pebesma, also the lead for sf, is for multi-dimensional spatiotemporal arrays. It largely has a different niche than raster or terra so is perhaps a compliment, but I’d expect there to be some degree of overlap.
Installation of both packages is available from CRAN.
install.packages(c("stars","terra"))
library("stars")
library("terra")
Although we won’t be working with external GIS in this workshop, there are several packages that provide ways to move back and forth from another GIS and R.
spgrass6, but for the latest version of GRASS, GRASS 7.There are many different packages available for creating maps in R, in the demos to come we will focus on two packages, ggplot2 and mapview, in addition to the default plotting that sf and raster provide.
To make sure these are installed:
install.packages("ggplot2")
library("ggplot2")
install.packages("mapview")
library("mapview")
We will use ggplot2 to build static maps, and use mapview for interactive maps. In addition to these, it probably pays to be aware of a few others that are commonly used for mapping.
leaflet: R wrapper for the Leaflet javascript library. The mapview package is actually a wrapper (in-part) around this pacakgetmap: Built for making thematic maps and has built in support to scale bars, north arrows, etc.ggmap: Built on top of ggplot2, ggmap allows for combining spatial data with map tiles from sources such as Goolge Maps or Stamen Maps.In addition to the resources and links already provide, there are three really nice resources that are my first stops for learning how to use R for spatial data analysis.
note: All involved are aware of the confusing “r-spatial” vs “rspatial” naming. Probably not going to change anytime soon. They both serve as useful resources and the authors of both are, in spite of what the naming suggests, working fairly closely together to improve spatial analysis capabilities in R.
The rest of our time, I will go over how to read spatial data into R. This all assumes that we have our datasets downloaded and available on our local machine. Let’s look at some data! I have downloaded both raster and vector data for the state. This includes two rasters, land cover and elevation, and one vector dataset, the RI State boundary.
Before we do this, let’s go ahead and make sure all of our packages are available to use.
library(sf)
library(raster)
library(terra)
library(stars)
library(ggplot2)
library(mapview)
library(dplyr)
Reading in data with sf is pretty straightforward. I have my data sitting in a “data” folder and need to point to that when reading in the data.
ri <- st_read("data/State_Boundary_1997.shp")
## Reading layer `State_Boundary_1997' from data source `C:\Users\JHollist\OneDrive - Environmental Protection Agency (EPA)\projects\pc_rspatial\data\State_Boundary_1997.shp' using driver `ESRI Shapefile'
## Simple feature collection with 354 features and 4 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -71.89239 ymin: 41.14656 xmax: -71.12052 ymax: 42.01888
## CRS: 4326
plot(ri)#$geometry)
I am specifying just the geometry here becuase the default plotting behavior for these sf objects is to create a plot all of the attributes. This is not usually what I want, so I just specify the geometry column.
One of the benefits of using sf is the speed. Let’s look at a biggish shape file with 1 million points!
#The old way
system.time(rgdal::readOGR("data","big"))
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\JHollist\OneDrive - Environmental Protection Agency (EPA)\projects\pc_rspatial\data", layer: "big"
## with 1000000 features
## It has 1 fields
## user system elapsed
## 156.51 78.44 131.94
#The sf way
system.time(st_read("data/big.shp"))
## Reading layer `big' from data source `C:\Users\JHollist\OneDrive - Environmental Protection Agency (EPA)\projects\pc_rspatial\data\big.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1000000 features and 1 field
## geometry type: POINT
## dimension: XY
## bbox: xmin: -71.03768 ymin: 41.05976 xmax: -69.09763 ymax: 43.00856
## CRS: 4326
## user system elapsed
## 71.63 17.97 51.93
The vector world in R has seen a lot of change over the last several years but is starting to stabilize around sf. The raster packages in R are where vector packages were a couple of years ago. We will mostly use raster, but I want to show how to read in data with both terra and stars. First, to read in raster datasets we can use the raster packages raster() function. For most raster formats it is simply a matter of telling raster() to file location.
ri_nlcd <- raster("data/nlcd.tif")
ri_nlcd
## class : RasterLayer
## dimensions : 4327, 2864, 12392528 (nrow, ncol, ncell)
## resolution : 30, 30 (x, y)
## extent : -8003022, -7917102, 5033997, 5163807 (xmin, xmax, ymin, ymax)
## crs : +proj=merc +lon_0=0 +lat_ts=0 +x_0=0 +y_0=0 +a=6378137 +b=6378137 +units=m +no_defs
## source : C:/Users/JHollist/OneDrive - Environmental Protection Agency (EPA)/projects/pc_rspatial/data/nlcd.tif
## names : nlcd
## values : 11, 95 (min, max)
plot(ri_nlcd)
To read this data in using terra
ri_nlcd_terra <- rast("data/nlcd.tif")
ri_nlcd_terra
## class : SpatRaster
## dimensions : 4327, 2864, 1 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : -8003022, -7917102, 5033997, 5163807 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=merc +lon_0=0 +lat_ts=0 +x_0=0 +y_0=0 +a=6378137 +b=6378137 +units=m +no_defs
## data source : data/nlcd.tif
## names : nlcd
## min values : 11
## max values : 95
plot(ri_nlcd_terra)
To read this data in using stars
ri_nlcd_stars <- read_stars("data/nlcd.tif")
ri_nlcd_stars
## stars object with 2 dimensions and 1 attribute
## attribute(s), summary of first 1e+05 cells:
## nlcd.tif
## 31 :28028
## 33 :14415
## 80 :13823
## 12 : 9933
## 13 : 9165
## 11 : 8896
## (Other):15740
## dimension(s):
## from to offset delta
## x 1 2864 -8003022 30
## y 1 4327 5163807 -30
## refsys point values
## x PROJCS["unnamed",GEOGCS["... FALSE NULL [x]
## y PROJCS["unnamed",GEOGCS["... FALSE NULL [y]
plot(ri_nlcd_stars)
So, not much difference other than how these are plotting and the information we get when we type the name. Is one faster or slower?
system.time(raster("data/nlcd.tif"))
## user system elapsed
## 0.02 0.00 0.02
system.time(rast("data/nlcd.tif"))
## user system elapsed
## 0.02 0.02 0.01
system.time(read_stars("data/nlcd.tif"))
## user system elapsed
## 0.80 0.06 0.48
Ah ha, stars loses. Maybe those stars objects are really tiny?
format(object.size(ri_nlcd), units = "Kb")
## [1] "15.4 Kb"
format(object.size(ri_nlcd_terra), units = "Kb")
## [1] "1.3 Kb"
format(object.size(ri_nlcd_stars), units = "Kb")
## [1] "48435.4 Kb"
Whoa! So, stars is slow and big. This actually is important. R usually reads all of your data into memory and then acts on it. This has several implications. First, it is usually faster when things are all loaded into memory, but there are trade-offs. It can take longer to read stuff in when it is getting read into memory and you can only work with datasets that are smaller than your available memory. What I suspect (but don’t no for certain becuase I am mostly arm-waiving as I write this), is that stars is reading the dataset into memory which is why it is 1) bigger and 2) slower to read. I bet we will see some performance improvements with some operations though. Conversely, I know that raster only points to the data and pretty certain terra does to. This is fast for reading and lets us connect to big datasets, but might bite us when we go to work with the data. In any event, I have very little experience using terra and stars so don’t know the in’s an out’s of which is faster when. Also, I am pretty sure there will be overlap in some functionality between the three packages, but also some functionality unique to them. Again, I am not familiar enough with the newer packages to know much about this. That being said it is an active area of development and we at least know how to read data in!
For the rest of the demo, we will stick with the old standard, raster.
There are two general categories for maps, static and interactive. Static maps are the types we would generate for publications and interactive are what we want when working with data or have maps available via websites. We are only going to show the absolute basics for both.
ggplot2Let’s use the state boundary to make a simple static map of the NLCD for Rhode Island.
# First let's match up coordinate reference systems
ri_prj <- st_transform(ri, st_crs(ri_nlcd))
ri_map <- ggplot(ri_prj) +
geom_sf(size = 1, color = "black", fill = "steelblue4", alpha = 0.3) +
labs(title = "Rhode Island") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ri_map
Hey look, a map! We are stopping with simple here, but just wanted to let you know that you can create maps with
ggplot2. I have another example of some older work of mine below that is a bit more complicated showing that you can really create whatever you can imagine.
mapviewLet’s expand our map a bit, but now we want to add the land cover data and have our data be interactive. First let’s clean up our NLCD data becuase right now we have NLCD for portions of Massachusetts and Connecticut.
# Crop and mask the NLCD to the state boundaries
ri_nlcd_mask <- mask(crop(ri_nlcd, ri_prj), ri_prj)
# Use mapview from the mapview package
ri_mv <- mapview(ri_prj, lwd = 1.5, color = "black", fill = NA) + mapview(ri_nlcd_mask)
ri_mv
So far we have seen how to read in our data, take a quick look at the resulting object in R, and create some simple maps. Now lets play around with some very simple analysis. We will run generate some samples and extract information for those samples and calculate summary statistics. Let’s assume that we are interested in elevation by land cover type. Maybe not an interesting question, but we have the data handy so it’ll do.
First, lets create some sample locations. We could do a simple random sample but then we are likely to not sample all of our land cover classes appropriately. So, instead of that let’s try a stratified sample where we sample the same number of samples in each land cover class. This will result in oversampling the less common classes, but for our purposes today it should be fine.
# Create a stratified sample of landcover with 30 samples per class
# The sp argument gives a point dataset as the results
ri_samples <- sampleStratified(ri_nlcd_mask, size = 30, sp = TRUE)
Remember that the raster package is an older package and it was built when the sp package was used for vector data. We are in now operating in the sf world so we just want to make sure that our data are all of the same type. It is pretty easy to switch back and forth.
# convert sp samples to sf samples
ri_samples <- st_as_sf(ri_samples)
Lastly, let’s just be sure that this worked as expected. We can use the extract function from raster to extract out values for each point from the original land cover data. If the stratified sample worked it should have 30 points per class.
# Extract gets our land cover classes for each point, and the table function
# summarizes those.
ri_lulc_samples <- extract(ri_nlcd_mask, ri_samples)
table(ri_lulc_samples)
## ri_lulc_samples
## 11 21 22 23 24 31 41 42 43 52 71 81 82 90 95
## 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Looks good! And we should at least map this, becuase, well GIS!
mapview(ri_nlcd_mask) + mapview(ri_samples)
Now that we have our samples, we can go ahead and get the data for simple elevation analysis. We’ve seen how to do that already with the extract function. But first we need to read in our elevation data.
ri_elev <- raster("data/elev.tif")
ri_elev_samples <- extract(ri_elev, ri_samples)
This only gets us part of the way there. We now have an object that is just the elevation numbers. We need to figure out how to get our landcover information associated with the elevation data. Luckily, the order of this object will help. We already extracted the landcover all we need to do is join these to objects together into a data frame.
# Create the data frame
ri_lulc_elev <- data.frame(lulc = ri_lulc_samples, elev = ri_elev_samples)
# Make this a special type of data frame, a tibble.
# Printing to the screen is cleaner
ri_lulc_elev <- as_tibble(ri_lulc_elev)
ri_lulc_elev
## # A tibble: 450 x 2
## lulc elev
## <dbl> <dbl>
## 1 11 45
## 2 11 105
## 3 11 0
## 4 11 2
## 5 11 89.3
## 6 11 9
## 7 11 36
## 8 11 27
## 9 11 100
## 10 11 11
## # ... with 440 more rows
Lastly, we can use the dplyr pacakge to summarize the information in this data frame.
ri_lulc_elev <- group_by(ri_lulc_elev, lulc)
ri_lulc_elev_summary <- summarize(ri_lulc_elev,
avg_elev = mean(elev, na.rm = TRUE),
sd_elev = sd(elev, na.rm = TRUE),
min_elev = min(elev, na.rm = TRUE),
max_elev = max(elev, na.rm = TRUE))
ri_lulc_elev_summary
## # A tibble: 15 x 5
## lulc avg_elev sd_elev min_elev max_elev
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11 55.2 47.1 0 165
## 2 21 70.0 53.1 6.31 179
## 3 22 50.4 42.3 1.10 156.
## 4 23 41.0 28.5 2 114
## 5 24 32.8 35.2 1.80 141
## 6 31 49.9 51.7 0 150
## 7 41 104. 63.2 4.35 215.
## 8 42 81.5 45.9 1.80 199.
## 9 43 88.1 53.5 4.78 201.
## 10 52 70.7 66.4 1 219.
## 11 71 47.1 54.0 1 234.
## 12 81 64.1 52.4 6 199
## 13 82 65.1 46.4 5.54 173
## 14 90 47.6 40.3 3 145
## 15 95 25.9 43.8 0 132